12 research outputs found

    POSE: getting over grainsize in parallel discrete event simulation

    Full text link
    Parallel discrete event simulations (PDES) encom-pass a broad range of analytical simulations. Their utility lies in their ability to model a system and pro-vide information about its behavior in a timely manner. Current PDES methods provide limited performance im-provements over sequential simulation. Many logical models for applications have fine granularity making them challenging to parallelize. In POSE, we exam-ine the overhead required for optimistically synchroniz-ing events. We have designed an object model based on the concept of virtualization and new adaptive op-timistic methods to improve the performance of fine-grained PDES applications. These novel approaches exploit the speculative nature of optimistic protocols to improve single-processor parallel over sequential per-formance and achieve scalability for previously hard-to-parallelize fine-grained simulations.1 1

    Highly scalable parallel sorting

    Full text link
    Abstract ā€” Sorting is a commonly used process with a wide breadth of applications in the high performance computing field. Early research in parallel processing has provided us with comprehensive analysis and theory for parallel sorting algorithms. However, modern super-computers have advanced rapidly in size and changed significantly in architecture, forcing new adaptations to these algorithms. To fully utilize the potential of highly parallel machines, tens of thousands of processors are used. Efficiently scaling parallel sorting on machines of this magnitude is inhibited by the communication-intensive problem of migrating large amounts of data between processors. The challenge is to design a highly scalable sorting algorithm that uses minimal communication, max-imizes overlap between computation and communication, and uses memory efficiently. This paper presents a scal-able extension of the Histogram Sorting method, making fundamental modifications to the original algorithm in order to minimize message contention and exploit overlap. We implement Histogram Sort, Sample Sort, and Radix Sort in CHARM++ and compare their performance. The choice of algorithm as well as the importance of the optimizations is validated by performance tests on two predominant modern supercomputer architectures: XT4 at ORNL (Jaguar) and Blue Gene/P at ANL (Intrepid). I

    Optimizing an MPI weather forecasting model via processor virtualization

    Full text link
    Abstractā€”Weather forecasting models are computationally intensive applications. These models are typically executed in parallel machines and a major obstacle for their scalability is load imbalance. The causes of such imbalance are either static (e.g. topography) or dynamic (e.g. shortwave radiation, moving thunderstorms). Various techniques, often embedded in the applicationā€™s source code, have been used to address both sources. However, these techniques are inflexible and hard to use in legacy codes. In this paper, we demonstrate the effectiveness of processor virtualization for dynamically balancing the load in BRAMS, a mesoscale weather forecasting model based on MPI paral-lelization. We use the Charm++ infrastructure, with its over-decomposition and object-migration capabilities, to move sub-domains across processors during execution of the model. Pro-cessor virtualization enables better overlap between computation and communication and improved cache efficiency. Furthermore, by employing an appropriate load balancer, we achieve better processor utilization while requiring minimal changes to the modelā€™s code. I

    Achieving strong scaling with NAMD on Blue Gene/L

    Full text link
    NAMD is a scalable molecular dynamics application, which has demonstrated its performance on several paral-lel computer architectures. Strong scaling is necessary for molecular dynamics as problem size is fixed, and a large number of iterations need to be executed to understand in-teresting biological phenomenon. The Blue Gene/L ma-chine is a massive source of compute power. It consists of tens of thousands of embedded Power PC 440 proces-sors. In this paper, we present several techniques to scale NAMD to 8192 processors of Blue Gene/L. These include topology specific optimizations, new messaging protocols, load-balancing, and overlap of computation and communi-cation. We were able to achieve 1.2 TF of peak performance for cutoff simulations and 0.99 TF with PME.

    Hierarchical Load Balancing for Charm++ Applications on Large Supercomputers

    Full text link
    Abstract ā€” Large parallel machines with hundreds of thou-sands of processors are being built. Recent studies have shown that ensuring good load balance is critical for scaling certain classes of parallel applications on even thousands of processors. Centralized load balancing algorithms suffer from scalability problems, especially on machines with relatively small amount of memory. Fully distributed load balancing algorithms, on the other hand, tend to yield poor load balance on very large machines. In this paper, we present an automatic dynamic hierarchical load balancing method that overcomes the scala-bility challenges of centralized schemes and poor solutions of traditional distributed schemes. This is done by creating multiple levels of aggressive load balancing domains which form a tree. This hierarchical method is demonstrated within a measurement-based load balancing framework in Charm++. We present techniques to deal with scalability challenges of load balancing at very large scale. We show performance data of the hierarchical load balancing method on up to 16,384 cores of Ranger (at TACC) for a synthetic benchmark. We also demonstrate the successful deployment of the method in a scientific application, NAMD with results on the Blue Gene/P machine at ANL. I

    Energy-efficient computing for HPC workloads on heterogeneous manycore chips

    Full text link
    Power and energy efficiency is one of the major challenges to achieve exascale computing in the next several years. While chips operating at low voltages have been studied to be highly energy-efficient, low voltage operations lead to heterogeneity across cores within the microprocessor chip. In this work, we study chips with low voltage operation and discuss programming systems, and performance modeling in the presence of heterogeneity. We propose an integer linear programming based approach for selecting optimal configu-ration of a chip that minimizes its energy consumption. We obtain an average of 26 % and 10.7 % savings in energy con-sumption of the chip for two HPC mini-applications- min-iMD and Jacobi, respectively. We also evaluate the energy savings with execution time constraints, using the proposed approach. These energy savings are significantly more than the savings by sub-optimal configurations obtained from heuristics

    Scaling Hierarchical N-body Simulations on GPU Clusters

    Full text link
    Abstract ā€” This paper focuses on the use of GPGPU-based clus-ters for hierarchical N-body simulations. Whereas the behavior of these hierarchical methods has been studied in the past on CPU-based architectures, we investigate key performance issues in the context of clusters of GPUs. These include kernel orga-nization and efficiency, the balance between tree traversal and force computation work, grain size selection through the tuning of offloaded work request sizes, and the reduction of sequential bottlenecks. The effects of various application parameters are studied and experiments done to quantify gains in performance. Our studies are carried out in the context of a production-quality parallel cosmological simulator called ChaNGa. We highlight the re-engineering of the application to make it more suitable for GPU-based environments. Finally, we present performance results from experiments on the NCSA Lincoln GPU cluster, including a note on GPU use in multistepped simulations

    Simulating Large Scale Parallel Applications Using Statistical Models for Sequential Execution Blocks

    Full text link
    Predicting sequential execution blocks of a large scale parallel application is an essential part of accurate prediction of the overall performance of the application. When simulating a future machine that is not yet fabricated, or a prototype system only available at a small scale, it becomes a significant challenge. Using hardware simulators may not be feasible due to excessively slowed down execution times and insufficient resources. These challenging issues become increasingly difficult in proportion to scale of the simulation. In this paper, we propose an approach based on statistical models to accurately predict the performance of the sequential execution blocks that comprise a parallel application. We de-ployed these techniques in a trace-driven simulation framework to capture both the detailed behavior of the application as well as the overall predicted performance. The technique is validated using both synthetic benchmarks and the NAMD application. Index Termsā€”parallel simulator, performance prediction, trace-driven, machine learning, statistical model I

    Automatic MPI to AMPI Program Transformation Using Photran

    Full text link
    Abstract. Adaptive MPI, or AMPI, is an implementation of the Mes-sage Passing Interface (MPI) standard. AMPI benefits MPI applications with features such as dynamic load balancing, virtualization, and check-pointing. Because AMPI uses multiple user-level threads per physical core, global variables become an obstacle. It is thus necessary to con-vert MPI programs to AMPI by eliminating global variables. Manually removing the global variables in the program is tedious and error-prone. In this paper, we present a Photran-based tool that automates this task with a source-to-source transformation that supports Fortran. We eval-uate our tool on the multi-zone NAS Benchmarks with AMPI. We also demonstrate the tool on a real-world large-scale FLASH code and present preliminary results of running FLASH on AMPI. Both results show sig-nificant performance improvement using AMPI. This demonstrates that the tool makes using AMPI easier and more productive.

    Efficient 'Cool Down' of Parallel Applications

    No full text
    Abstractā€”As we move to exascale machines, both peak power and total energy consumption have become prominent major challenges. There has been a lot of research on saving machine energy consumption for HPC data centers. However, a significant part of energy consumption for HPC data centers can be attributed to cooling the machine room. We have already shown significant reduction in cooling energy consumption by constrain-ing core temperatures in our previous work. In this work, we strive to save machine energy consumption while constraining core temperatures in order to provide a total energy solution for HPC data centers that saves both machine and cooling energy consumption. Our approach uses Dynamic Voltage and Frequency Scaling (DVFS) to constrain core temperatures and is particularly designed to reduce the timing penalty associated with DVFS. Using a heuristic that exploits the difference in frequency sensitivity for different parts of an application, we present results that show 17 % reduction in machine energy consumption with as little as 0.9 % increase in execution time while constraining core temperatures below 60 ā—¦ C. I
    corecore